COVID-19 Impact on NYC Property Values: The Residential Density Penalty

Author

socoyjonathan

Introduction

Prior to the pandemic, highly dense neighborhoods were known for exacerbating prices due to close proximity to jobs, services, social spaces, transportation access, and a plethora of amenities. But the pandemic introduced unprecedented challenges including the closing of offices and the collapse of transit ridership reducing the premium for living near work, and remote work expanded rapidly increasing demand for personal space that is scarce in denser areas.

This project asks the specific question: Did high residential density become a penalty for property values post-COVID across NYC’s Community Districts, and did this ‘density penalty vary by baseline density level?’ Rather than asking whether dense neighborhoods became “expensive” due to the pandemic or comparing cities or regions, his analysis examines variation within a single metropolitan housing market, that is NYC. By holding constant citywide and labor market conditions, this analysis isolates whether density or neighborhood characteristics correlated with density, shaped post-COVID housing price trajectories.

I first evaluate whether residential density predicts post-COVID housing price changes in isolation and across density levels. Then, I integrate density into a broader multivariable framework alongside education, transit usage, crime, and job accessibility, not only allowing the analysis to distinguish whether density operates independently or merely proxies for other neighborhood characteristics, but also allows to answer our overall question.

Prior Literature and Contribution

Recent research has documented sharp spatial divergence in U.S. housing markets during the COVID-19 pandemic. Hermann and Whitney (2024) show that from 2020 to 2023, home prices grew fastest in lower-density counties, suggesting that the pandemic weakened the traditional density premium as households prioritized space and reduced crowding. While compelling, this evidence operates at a coarse geographic scale: counties often bundle dense urban cores together with suburban and exurban areas, making it difficult to distinguish within-city neighborhood dynamics from broader regional sorting.

This project builds directly on that literature by shifting the unit of analysis to New York City’s 59 community districts. Focusing on a single metropolitan area allows neighborhoods to be compared under identical citywide conditions while preserving sharp variation in residential density, housing stock, income, and demographic composition. This design enables a more granular test of whether density itself became a liability during COVID or whether the observed patterns reflect correlated neighborhood traits.

I develop a hierarchy of regression models, beginning with density alone and progressively adding socioeconomic controls and borough fixed effects. his approach makes it possible to evaluate whether density retains explanatory power once broader geographic structure and neighborhood composition are accounted for.

Data Acquisition and Processing

Load Required Packages
# Suppress package startup messages
suppressPackageStartupMessages({
  library(httr2)        # Modern HTTP requests
  library(sf)           # Spatial data
  library(jsonlite)     # JSON parsing
  library(dplyr)        # Data manipulation
  library(tidyr)        # Data tidying
  library(stringr)      # String operations
  library(readr)        # Data import
  library(janitor)      # Data cleaning
  library(glue)         # String interpolation
  library(lubridate)    # Date handling
  library(scales)       # Formatting
  library(ggplot2)      # Visualization
  library(knitr)        # Tables
  library(broom)        # Model tidying
  library(tidycensus)   # Census API
  library(leaflet)      # Interactive maps
  library(htmlwidgets)  # Widget export
  library(viridis)      # Color palettes
  library(purrr)        # Functional programming
  library(tigris)
  options(tigris_use_cache = TRUE)
  options(dplyr.summarise.inform = FALSE)
})

# Create output directory
if (!dir.exists("output")) dir.create("output", recursive = TRUE)

To construct the analysis dataset, I integrate four primary data sources at the community district level:

  • Official Community District Boundaries
  • PLUTO Tax Lots records
  • Department of Finance Rolling Sales
  • Tract-level American Community Survey

Once I aggregate to districts using spatial methods, my integrated analysis will allow me to study how a relatively fixed neighborhood characteristic relates to housing market changes following a citywide shock.

Community District Boundaries

Since Community Districts are NYC’s primary unit of neighborhood governance, representing 59 districts across five boroughs, my model will incorporate these boundaries for my geographic units. These boundaries provide an ideal spatial unit: large enough for stable price measures, yet fine enough to capture meaningful neighborhood variation.

Download Community District Boundaries
#' Download NYC Community District Boundaries
#'
#' Purpose:
#'   Retrieves official NYC Community District (CD) boundaries from the
#'   NYC ArcGIS REST API and standardizes them for spatial analysis.
#'
#' Key design choices:
#'   • Local caching to avoid repeated network calls
#'   • Geometry transformed to NYC State Plane (EPSG:2263) for area-accurate analysis
#'   • Creation of a human-readable cd_id (MN01, BK12...)
#'   • Filtering to the 59 official community districts
#'
#' @param dir_path Directory for caching downloaded data
#' @return sf object with CD boundaries and standardized identifiers
get_cd_shapes <- function(dir_path = "data") {
  # Ensure cache directory exists
  if (!dir.exists(dir_path)) dir.create(dir_path, recursive = TRUE)
  # Cache file paths
  fname <- file.path(dir_path, "cd_shapes.rds")
  geojson_file <- file.path(dir_path, "cd_shapes.geojson")
  
  # Use cached version if available
  if (file.exists(fname)) {
    message("Using cached CD shapes: ", fname)
    cd_shapes <- readRDS(fname)
    
    # Check if cd_id exists or are upgraded with standardized IDs, if not create cd_id
    if (!"cd_id" %in% names(cd_shapes)) {
      message("Updating cached version with cd_id column...")
      # Re-project for consistency with spatial analysis
      cd_shapes <- cd_shapes |>
        st_transform("EPSG:2263") |>
        mutate(
          # Numeric borough-district code (... 101, 204)
          boro_cd = as.integer(BoroCD),
          
          # Extract borough digit (1–5)
          boro_num = substr(sprintf("%03d", boro_cd), 1, 1),
          
          # Map borough digit to standard abbreviation
          boro_abbr = case_when(
            boro_num == "1" ~ "MN",
            boro_num == "2" ~ "BX",
            boro_num == "3" ~ "BK",
            boro_num == "4" ~ "QN",
            boro_num == "5" ~ "SI",
            TRUE ~ NA_character_
          ),
          
          # Two-digit community district number (...01, 02)
          cd_num = sprintf("%02d", boro_cd %% 100),
          
          # Final standardized ID (...MN01, QN14)
          cd_id = paste0(boro_abbr, cd_num)
        ) |>
        
        # Remove non-standard districts (>18)
        filter(as.integer(cd_num) <= 18)
      
      # Re-save with cd_id
      saveRDS(cd_shapes, fname)
    }
    
    return(cd_shapes)
  }
  
  message("Downloading Community Districts from NYC ArcGIS API...")
  
  # ArcGIS REST API request:
  tryCatch({
    
    # # Request all features and return GeoJSON for sf compatibility
    req <- request("https://services5.arcgis.com/GfwWNkhOj9bNBqoJ/arcgis/rest/services/NYC_Community_Districts/FeatureServer/0/query"       ) |>
      req_url_query(
        where = "1=1",
        outFields = "*",
        outSR = "4326", # WGS84 for API output
        f = "geojson"
      ) |>
      req_headers("User-Agent" = "Mozilla/5.0")
    
    # Execute request and cache raw GeoJSON
    resp <- req_perform(req)
    writeLines(resp_body_string(resp), geojson_file)
    
    # Read and standardize spatial data
    cd_shapes <- st_read(geojson_file, quiet = TRUE) |>
      st_transform("EPSG:2263") |>
      mutate(
        boro_cd = as.integer(BoroCD),
        boro_num = substr(sprintf("%03d", boro_cd), 1, 1),
        boro_abbr = case_when(
          boro_num == "1" ~ "MN",
          boro_num == "2" ~ "BX",
          boro_num == "3" ~ "BK",
          boro_num == "4" ~ "QN",
          boro_num == "5" ~ "SI",
          TRUE ~ NA_character_
        ),
        cd_num = sprintf("%02d", boro_cd %% 100),
        cd_id = paste0(boro_abbr, cd_num)
      ) |>
      # Keep only the 59 official community districts
      filter(as.integer(cd_num) <= 18)
    
    # Data integrity check
    if (nrow(cd_shapes) != 59) {
      warning("Expected 59 CDs, got ", nrow(cd_shapes))
    } else {
      message("Successfully loaded 59 Community Districts")
    }
    
    saveRDS(cd_shapes, fname)
    cd_shapes
    
  }, error = function(e) {
    stop("ERROR downloading CD shapes: ", e$message)
  })
}

# Load CD boundaries (from cache or API)
cd_shapes_raw <- get_cd_shapes()

Data Quality Check:

Because community districts are the primary unit of analysis and I am aggregating housing prices and ACS variables by CD, I verify that the spatial dataset contains the expected 59 districts, spans all boroughs, and uses a consistent coordinate system before proceeding.

Verify CD Data Quality
# Ensure standardized identifiers exist:
# This block guarantees that cd_shapes_raw is in its final,
# analysis-ready form even if it was loaded from an older cache
# or created without necessary fields.
if (!"cd_id" %in% names(cd_shapes_raw)) {
  cd_shapes_raw <- cd_shapes_raw |>
    # Re-project to NYC State Plane for spatial consistency
    st_transform("EPSG:2263") |>
    mutate(
      boro_cd = as.integer(BoroCD),
      boro_num = substr(sprintf("%03d", boro_cd), 1, 1),
      boro_abbr = case_when(
        boro_num == "1" ~ "MN",
        boro_num == "2" ~ "BX",
        boro_num == "3" ~ "BK",
        boro_num == "4" ~ "QN",
        boro_num == "5" ~ "SI",
        TRUE ~ NA_character_
      ),
      cd_num = sprintf("%02d", boro_cd %% 100),
      cd_id = paste0(boro_abbr, cd_num)
    ) |>
    # Restrict to the 59 official community districts
    filter(as.integer(cd_num) <= 18)
}

# Overall integrity summary
tibble(
  Metric = c("Total Community Districts", "Number of Boroughs", "Coordinate System"),
  Value = c(
    as.character(nrow(cd_shapes_raw)),
    as.character(n_distinct(cd_shapes_raw$boro_abbr)),
    st_crs(cd_shapes_raw)$input
  )
) |>
  kable(
    col.names = c("Metric", "Value"),
    caption = "Community District Data Summary",
    align = c("l", "r")
  )
Community District Data Summary
Metric Value
Total Community Districts 59
Number of Boroughs 5
Coordinate System EPSG:2263
Verify CD Data Quality
# Borough-level validation
# Ensures each borough has the expected number of districts and
# that no districts are missing or duplicated after filtering
cd_shapes_raw |>
  st_drop_geometry() |>
  count(boro_abbr, name = "n_districts") |>
  arrange(boro_abbr) |>
  mutate(
    # Expand abbreviations for report readability
    boro_abbr = case_when(
      boro_abbr == "MN" ~ "Manhattan",
      boro_abbr == "BX" ~ "Bronx",
      boro_abbr == "BK" ~ "Brooklyn",
      boro_abbr == "QN" ~ "Queens",
      boro_abbr == "SI" ~ "Staten Island",
      TRUE ~ boro_abbr
    )
  ) |>
  kable(
    col.names = c("Borough", "Districts"),
    caption = "Community Districts by Borough",
    align = c("l", "r")
  )
Community Districts by Borough
Borough Districts
Brooklyn 18
Bronx 12
Manhattan 12
Queens 14
Staten Island 3

PLUTO Data for BBL-CD Crosswalk

Since PLUTO contains contains comprehensive tax lot information, we can match DOF property sales data to community districts using PLUTO records to build our density analysis across CD neighborhoods.

Download PLUTO Data with Incremental Caching
#' Purpose:
#'   PLUTO is ~850k rows, too large for a single API request.
#'   This function downloads the data in fixed-size chunks and
#'   caches each chunk locally so interrupted runs can resume.
#' 
#' Key aspects:
#'   • Chunked API requests via $limit / $offset
#'   • Per-chunk JSON caching for fault tolerance
#'   • Final RDS cache for fast reloads in later sessions
#'
#' @param dir_path Directory for cached chunks and final RDS
#' @param limit Number of records per API request
#' @return Tibble containing full PLUTO dataset
get_pluto <- function(dir_path = "data", limit = 50000) {
  # Ensure cache directory exists
  if (!dir.exists(dir_path)) dir.create(dir_path, recursive = TRUE)
  # Final combined cache 
  final_file <- file.path(dir_path, "pluto_raw.rds")
  # if final combined cache exists, we skip all API calls
  if (file.exists(final_file)) {
    message("Using cached PLUTO data: ", final_file)
    return(readRDS(final_file))
  }
  
  message("Downloading PLUTO from NYC Open Data API...")
  
  #  NYC Open Data endpoint for PLUTO
  base_url <- "https://data.cityofnewyork.us/resource/64uk-42ks.json"
  offset <- 0            # Controls API pagination
  chunk_index <- 1       # Used for chunk filenames
  all_chunks <- list()   # Holds data before final bind
  
  repeat {
    # cache file for each chunk
    chunk_file <- file.path(dir_path, sprintf("pluto_chunk_%05d.json", chunk_index))
    
    if (!file.exists(chunk_file)) {
      message("Chunk ", chunk_index, " | offset = ", comma(offset))
      
      tryCatch({
        # Construct API request with pagination parameters
        req <- request(base_url) |>
          req_url_query(`$limit` = limit, `$offset` = offset) |>
          req_headers("User-Agent" = "Mozilla/5.0")
        
        # Execute request and save raw JSON to files
        resp <- req_perform(req)
        writeLines(resp_body_string(resp), chunk_file)
        Sys.sleep(1) # careful rate limiting to avoid API throttling
        
      }, error = function(e) {
        stop("ERROR: chunk ", chunk_index, ": ", e$message)
      })
    } else {
      message("Using cached chunk ", chunk_index)
    }
    
    # Read cached chunk
    chunk_data <- fromJSON(chunk_file)
    
    # Empty response = no more records from API
    if (length(chunk_data) == 0 || nrow(chunk_data) == 0) {
      message("No more data")
      break
    }
    
    message("Rows: ", comma(nrow(chunk_data)))
    
    # Store chunk for later binding
    all_chunks[[chunk_index]] <- chunk_data
    
    # If fewer rows than requested, final page
    if (nrow(chunk_data) < limit) break
    
    # Increment pagination state
    offset <- offset + limit
    chunk_index <- chunk_index + 1
  }
  
  # Combine all chunks into one tibble
  pluto_raw <- bind_rows(all_chunks)
  # Save final dataset for fast future reloads
  saveRDS(pluto_raw, final_file)
  message("SUCCESS: ", comma(nrow(pluto_raw)), " rows")
  
  pluto_raw
}

pluto_raw <- get_pluto()

Create BBL-CD Crosswalk:

Because PLUTO data are recorded at the parcel level, I construct a crosswalk that assigns each BBL to a standardized NYC Community District for aggregation and analysis.

Create BBL to CD Mapping
#' Create PLUTO to CD Crosswalk
#' Maps Borough-Block-Lot (BBL) identifiers to Community Districts.
#' @param dir_path Directory for cached data
#' @return Tibble mapping BBL → CD
create_pluto_crosswalk <- function(dir_path = "data") {
  # Cached output file
  crosswalk_file <- file.path(dir_path, "pluto_crosswalk.rds")
  
  if (file.exists(crosswalk_file)) {
    message("Using cached PLUTO crosswalk")
    return(readRDS(crosswalk_file))
  }
  
  message("Creating PLUTO crosswalk...")
  
  # Load required datasets
  pluto_raw <- get_pluto(dir_path = dir_path)
  cd_shapes_raw <- get_cd_shapes(dir_path = dir_path)
  
  # Build CD lookup table ~ standardized ID mapping
  cd_lookup <- cd_shapes_raw |>
    st_drop_geometry() |>
    select(boro_cd, cd_id) |>
    distinct()
  
  # Construct crosswalk from PLUTO
  crosswalk <- pluto_raw |>
    # Remove missing or placeholder community district codes
    filter(!is.na(cd), cd != "0", cd != "99") |>
    mutate(
      bbl = as.character(bbl),
      # Remove trailing decimal artifacts from API import
      bbl = str_replace(bbl, "\\.0+$", ""),
      # Convert PLUTO CD field to numeric borough-district code
      boro_cd = as.integer(cd)
    ) |>
    
    # Valid NYC BBLs are exactly 10 digits ~ removes malformed or partial identifiers
    filter(nchar(bbl) == 10) |>
    select(bbl, boro_cd) |>
    
    # Attach standardized CD identifiers
    left_join(cd_lookup, by = "boro_cd") |>
    
    # Drop parcels that could not be matched to a valid CD
    filter(!is.na(cd_id)) |>
    
    # Enforce one-to-one uniqueness
    distinct(bbl, cd_id, boro_cd)
  
  # Confirms there is coverage across the expected number of community districts
  message("Crosswalk covers ", n_distinct(crosswalk$cd_id), " CDs")
  
  # Cache finalized crosswalk
  saveRDS(crosswalk, crosswalk_file)
  crosswalk
}

crosswalk <- create_pluto_crosswalk()

DOF Rolling Sales

The Department of Finance publishes all property sales transactions in New York City. I restrict the data to residential properties (tax classes 1, 2, 2A, 2B, and 2C) with transactions exceeding $10,000 for 2017-2023.

Download DOF Sales Data for 2017-2023
#' Download DOF Rolling Sales Data
#' 
#' Retrieves NYC Department of Finance property sales transactions
#' for selected years using chunked API requests with caching for specified years.
get_dof_sales <- function(years = c(2017:2019, 2021:2023), 
                          dir_path = "data", 
                          limit = 50000) {
  
  if (!dir.exists(dir_path)) dir.create(dir_path, recursive = TRUE)
  
  final_file <- file.path(dir_path, "sales_raw.rds")
  
  if (file.exists(final_file)) {
    message("Using cached sales data: ", final_file)
    return(readRDS(final_file))
  }
  
  message("Downloading DOF Annualized Sales from NYC Open Data API...")
  
  # NYC Open Data endpoint for DOF Rolling Sales
  base_url <- "https://data.cityofnewyork.us/resource/w2pb-icbu.json"
  all_sales <- list()
  
  # Loop over years to control size of API call
  for (yr in years) {
    message("Year: ", yr)
    
    offset <- 0
    chunk_index <- 1
    year_chunks <- list()
    
    repeat {
      # Per-year, per-chunk cache file
      chunk_file <- file.path(dir_path, sprintf("sales_%d_chunk_%03d.json", yr, chunk_index))
      
      if (!file.exists(chunk_file)) {
        message("Chunk ", chunk_index, " | offset = ", comma(offset))
        
        tryCatch({
          # API request with date-bounded WHERE clause ~ SQL like code used here 
          req <- request(base_url) |>
            req_url_query(
              `$limit` = limit,
              `$offset` = offset,
              `$where` = sprintf("sale_date >= '%d-01-01' AND sale_date <= '%d-12-31'", yr, yr)
            ) |>
            req_headers("User-Agent" = "Mozilla/5.0")
          
          # Execute request and cache raw JSON
          resp <- req_perform(req)
          writeLines(resp_body_string(resp), chunk_file)
          Sys.sleep(1)
          
        }, error = function(e) {
          stop("ERROR downloading chunk ", chunk_index, " for year ", yr, ": ", e$message)
        })
      } else {
        message("Using cached chunk ", chunk_index)
      }
      
      chunk_data <- fromJSON(chunk_file)
      
      if (length(chunk_data) == 0 || nrow(chunk_data) == 0) {
        message("No more data for ", yr)
        break
      }
      
      message("Rows: ", comma(nrow(chunk_data)))
      year_chunks[[chunk_index]] <- chunk_data
      
      if (nrow(chunk_data) < limit) break
      
      offset <- offset + limit
      chunk_index <- chunk_index + 1
    }
    # Combine chunks for this year
    if (length(year_chunks) > 0) {
      all_sales[[as.character(yr)]] <- bind_rows(year_chunks)
      message("Total for ", yr, ": ", comma(nrow(all_sales[[as.character(yr)]])), " rows")
    }
  }
  # Combine all years into one dataset
  sales_raw <- bind_rows(all_sales)
  saveRDS(sales_raw, final_file)
  message("SUCCESS: ", comma(nrow(sales_raw)), " rows")
  
  sales_raw
}

sales_raw <- get_dof_sales()

Process and Match Sales to CDs

Inspect the data, we note property sales are recorded at the parcel level so we need to match each sale to a CD using a two-stage matching strategy using our PLUTO crosswalk that prioritizes exact parcel identifiers and falls back to block-level inference when necessary to prepare our datasets for analysis at CD level.

Clean Sales Data and Match to CDs
#' Process Sales Data with CD Matching
#' 
#' Uses two matching strategies:
#' 1. Exact BBL match (preferred)
#' 2. Block-level match (fallback)
process_sales_data <- function(dir_path = "data") {
  
  message("\n=== Processing Sales Data ===")
  
  # Sales provide parcel-level transactions
  sales_raw <- get_dof_sales(dir_path = dir_path)
  # the crosswalk provides a standardized mapping from BBLs to community districts
  crosswalk <- create_pluto_crosswalk(dir_path = dir_path)
  
  # Clean sales data removing sales prices less than $10000 and restricting my data to residential sales
  # while ensuring identifier integrity
  sales_clean <- sales_raw |>
    mutate(
      bbl = as.character(bbl),
      bbl = str_replace(bbl, "\\.0+$", ""),
      sale_price = as.numeric(sale_price),
      sale_date = as.Date(sale_date),
      tax_class = as.character(tax_class_at_time_of_sale),
      year = as.integer(format(sale_date, "%Y"))
    ) |>
    filter(
      # valid NYC BBL length
      nchar(bbl) == 10,
      !is.na(sale_price),
      # exclude non-market / nominal transfers
      sale_price >= 10000,
      !is.na(sale_date),
      !is.na(tax_class),
      tax_class %in% c("1", "2", "2A", "2B", "2C") # residential properties only
    )
  
  # Strategy 1: Exact BBL-to-CD match
  sales_matched_exact <- sales_clean |>
    inner_join(crosswalk, by = "bbl")
  
  n_exact <- nrow(sales_matched_exact)
  message("Exact BBL matches: ", comma(n_exact))
  

  # Identify remaining unmatched sales using anti_join
  sales_unmatched <- sales_clean |>
    anti_join(crosswalk, by = "bbl")
  
  # Strategy 2: Block-level inference 
  # When an exact parcel match fails, assign CDs based on the most
  # common CD observed for parcels on the same borough–block
  if (nrow(sales_unmatched) > 0) {
    block_lookup <- crosswalk |>
      mutate(
        block = as.integer(substr(bbl, 2, 6)),
        boro_digit = substr(bbl, 1, 1)
      ) |>
      count(boro_digit, block, cd_id, boro_cd) |>
      group_by(boro_digit, block) |>
      # dominant CD per block
      slice_max(n, n = 1, with_ties = FALSE) |>
      ungroup() |>
      select(boro_digit, block, cd_id, boro_cd)
    
    sales_matched_block <- sales_unmatched |>
      mutate(
        block = as.integer(substr(bbl, 2, 6)),
        boro_digit = substr(bbl, 1, 1)
      ) |>
      inner_join(block_lookup, by = c("boro_digit", "block")) |>
      select(-boro_digit, -block)
    
    n_block <- nrow(sales_matched_block)
    message("Block-level matches: ", comma(n_block))
    
    # Combine exact and inferred matches
    sales_with_cd <- bind_rows(sales_matched_exact, sales_matched_block)
  } else {
    sales_with_cd <- sales_matched_exact
    n_block <- 0
  }
  
  # Make sure CD assignment is mostly done by strategy 1: BBL->CD
  total_matched <- nrow(sales_with_cd)
  match_rate <- total_matched / nrow(sales_clean)
  
  message("Total matched: ", comma(total_matched))
  message("Match rate: ", percent(match_rate, accuracy = 0.1))
  
  sales_with_cd
}

sales_with_cd <- process_sales_data()

ACS Density Data

To measure pre-COVID residential density at the community district level, I aggregate tract-level ACS estimates for population and housing units using area-weighted spatial intersection to account for partial overlaps between census tracts and district boundaries along with median income as a control.

Download and Aggregate ACS Density Data to CDs
#' Download ACS Density Data and Aggregate to CDs
#' 
#' Uses area-weighted spatial intersection to aggregate tract-level
#' ACS data to Community Districts.
get_acs_cd_density <- function(end_year, cd_sf = NULL, period_label = NULL,
                                dir_path = "data") {
  
  if (!dir.exists(dir_path)) dir.create(dir_path, recursive = TRUE)
  
  cache_file <- file.path(dir_path, sprintf("acs_density_%s.rds", end_year))
  
  if (file.exists(cache_file)) {
    message("Using cached ACS density: ", end_year)
    base <- readRDS(cache_file)
    if (!is.null(period_label)) {
      base <- base |> mutate(period = period_label)
    }
    return(base)
  }
  
  message("Downloading ACS density data (", end_year, " 5-year)...")
  
  # Ensure CD geometries are available and standardized
  if (is.null(cd_sf)) {
    cd_shapes_raw <- get_cd_shapes(dir_path = dir_path)
    cd_sf <- cd_shapes_raw
    if (!"cd_id" %in% names(cd_sf)) {
      cd_sf <- cd_sf |>
        mutate(
          boro_cd = as.integer(BoroCD),
          boro_num = substr(sprintf("%03d", boro_cd), 1, 1),
          boro_abbr = case_when(
            boro_num == "1" ~ "MN",
            boro_num == "2" ~ "BX",
            boro_num == "3" ~ "BK",
            boro_num == "4" ~ "QN",
            boro_num == "5" ~ "SI",
            TRUE ~ NA_character_
          ),
          cd_num = sprintf("%02d", boro_cd %% 100),
          cd_id = paste0(boro_abbr, cd_num)
        ) |>
        filter(as.integer(cd_num) <= 18)
    }
  }
  
  # ACS variable selection
  # Population and housing units are used for density measures;
  # income is included for socioeconomic context
  acs_vars <- c(
    total_population = "B01003_001",
    total_housing_units = "B25001_001",
    median_income = "B19013_001"
  )
  
  # Download tract-level ACS with geometry
  # Tracts are the finest geography for consistent NYC-wide ACS data
  acs_tracts <- suppressMessages(
    get_acs(
      geography = "tract",
      variables = acs_vars,
      year = end_year,
      survey = "acs5",
      state = "NY",
      county = c("005", "047", "061", "081", "085"),
      geometry = TRUE,
      cache_table = TRUE,
      output = "wide"
    )
  )
  
  acs_tracts <- acs_tracts |>
    transmute(
      geoid = GEOID,
      total_population = total_populationE,
      total_housing_units = total_housing_unitsE,
      median_income = median_incomeE,
      geometry = geometry
    ) |>
    filter(!is.na(total_population), total_population > 0)
  # Geometry validation and CRS alignment
  acs_tracts <- st_make_valid(acs_tracts)
  cd_sf <- st_make_valid(cd_sf)
  cd_sf <- cd_sf |> st_transform(st_crs(acs_tracts))
  
  message("Intersecting tracts with CDs...")
  inter <- suppressWarnings(
    st_intersection(acs_tracts, cd_sf |> select(cd_id))
  )
  
  # Spatial intersection
  # Creates tract–CD fragments for area-weighted aggregation
  inter <- inter |>
    mutate(inter_area = as.numeric(st_area(geometry)))
  
  # Area-weight calculation
  # Each tract contributes proportionally to CDs based on overlap area
  tract_areas <- acs_tracts |>
    transmute(geoid, tract_area = as.numeric(st_area(geometry))) |>
    st_drop_geometry()
  
  # apply weights to how much land area each tract has if any overlap areas
  inter <- inter |>
    st_drop_geometry() |>
    left_join(tract_areas, by = "geoid") |>
    mutate(weight = if_else(tract_area > 0, inter_area / tract_area, 0))
  
  message("Aggregating to CD level...")
  
  # Area-weighted aggregation to CDs
  # Preserves population totals while correctly allocating partial tracts
  cd_density <- inter |>
    mutate(
      wt_population = total_population * weight,
      wt_housing_units = total_housing_units * weight,
      wt_income_pop = median_income * total_population * weight,
      wt_pop_for_income = total_population * weight
    ) |>
    group_by(cd_id) |>
    summarise(
      total_population = sum(wt_population, na.rm = TRUE),
      total_housing_units = sum(wt_housing_units, na.rm = TRUE),
      weighted_median_income = sum(wt_income_pop, na.rm = TRUE) / sum(wt_pop_for_income, na.rm = TRUE),
      .groups = "drop"
    ) |>
    arrange(cd_id)
  
  message("Calculating density metrics...")
  
  # CDs land area
  cd_areas <- cd_sf |>
    st_transform("EPSG:2263") |>
    mutate(land_area_sq_mi = as.numeric(st_area(geometry)) / 27878400) |>
    st_drop_geometry() |>
    select(cd_id, land_area_sq_mi)
  
  # Population density and housing density
  cd_density <- cd_density |>
    left_join(cd_areas, by = "cd_id") |>
    mutate(
      pop_density_per_sq_mi = total_population / land_area_sq_mi,
      housing_density_per_sq_mi = total_housing_units / land_area_sq_mi,
      acs_year = end_year
    )
  
  n_cds <- n_distinct(cd_density$cd_id)
  # Confirms all 59 community districts are represented
  if (n_cds != 59) {
    warning("ACS density has ", n_cds, " CDs (expected 59)")
  } else {
    message("ACS density covers all 59 CDs")
  }
  
  saveRDS(cd_density, cache_file)
  message("Saved to: ", cache_file)
  
  if (!is.null(period_label)) {
    cd_density <- cd_density |> mutate(period = period_label)
  }
  
  cd_density
}

#  establish my baseline density level for covid penalty analysis
cd_density <- get_acs_cd_density(
  end_year = 2019,
  period_label = "baseline"
)

Build Analysis Dataset

After cleaning and matching all data sources to community districts, I construct a standardized dataset. I build a community-district dataset that compares property prices and density levels for our defined pre-COVID (2017-2019) and post-COVID (2021-2023) periods, excluding 2020 to avoid extreme short-term pandemic disruptions.

Construct CD-Level Analysis Dataset
#' Build CD Sales Panel
build_sales_panel <- function(pre_years = c(2017, 2018, 2019),
                               post_years = c(2021, 2022, 2023),
                               dir_path = "data") {
  
  message("\n=== Building CD Sales Panel ===")
  
  sales_with_cd <- process_sales_data(dir_path = dir_path)
  
  cd_panel <- sales_with_cd |>
    mutate(
      period = case_when(
        year %in% pre_years ~ "pre_covid",
        year %in% post_years ~ "post_covid",
        TRUE ~ NA_character_
      )
    ) |>
    filter(!is.na(period)) |>
    group_by(cd_id, boro_cd, year, period) |>
    summarise(
      n_sales = n(),
      median_price = median(sale_price, na.rm = TRUE),
      mean_price = mean(sale_price, na.rm = TRUE),
      .groups = "drop"
    ) |>
    arrange(year, cd_id)
  
  message("Sales panel: ", n_distinct(cd_panel$cd_id), " CDs, ", nrow(cd_panel), " rows")
  
  cd_panel
}

cd_sales_panel <- build_sales_panel()

# Create analysis dataset
nyc_cd <- get_cd_shapes()

# Verify structure
stopifnot("cd_id must exist" = "cd_id" %in% names(nyc_cd))

cd_analysis_df <- cd_sales_panel |>
  mutate(
    period_group = if_else(period == "pre_covid", "pre", "post")
  ) |>
  group_by(cd_id, period_group) |>
  summarise(
    avg_median_price = mean(median_price, na.rm = TRUE),
    total_sales = sum(n_sales, na.rm = TRUE),
    .groups = "drop"
  ) |>
  pivot_wider(
    names_from = period_group,
    values_from = c(avg_median_price, total_sales),
    names_glue = "{.value}_{period_group}"
  ) |>
  mutate(
    price_change_pct = ((avg_median_price_post - avg_median_price_pre) / avg_median_price_pre) * 100
  )

# Add density data
cd_analysis_df <- cd_analysis_df |>
  left_join(
    cd_density |>
      select(
        cd_id,
        pop_density_2019 = pop_density_per_sq_mi,
        housing_density_2019 = housing_density_per_sq_mi,
        median_income_2019 = weighted_median_income,
        total_pop_2019 = total_population,
        total_housing_2019 = total_housing_units
      ),
    by = "cd_id"
  )

# Create density terciles
density_tercile_breaks <- quantile(
  cd_analysis_df$pop_density_2019,
  probs = c(0, 1/3, 2/3, 1),
  na.rm = TRUE
)

cd_analysis_df <- cd_analysis_df |>
  mutate(
    density_tercile = cut(
      pop_density_2019,
      breaks = density_tercile_breaks,
      labels = c("Low", "Medium", "High"),
      include.lowest = TRUE
    ),
    borough = case_when(
      substr(cd_id, 1, 2) == "MN" ~ "Manhattan",
      substr(cd_id, 1, 2) == "BX" ~ "Bronx",
      substr(cd_id, 1, 2) == "BK" ~ "Brooklyn",
      substr(cd_id, 1, 2) == "QN" ~ "Queens",
      substr(cd_id, 1, 2) == "SI" ~ "Staten Island",
      TRUE ~ NA_character_
    )
  )

message("Analysis dataset complete: ", nrow(cd_analysis_df), " CDs")

price_summary <- summary(cd_analysis_df$price_change_pct)

Data Analysis

Empirical Strategy

This analysis evaluates whether residential density became a penalty for property values following the COVID-19 pandemic and whether any observed “density penalty” reflects density itself or correlated neighborhood characteristics. Because COVID represents a sharp, citywide shock, the analysis exploits variation in how New York City’s Community Districts (CDs) changed relative to one another rather than attempting to estimate an absolute causal effect.

My analysis strategy proceeds in four sequential steps, moving from descriptive correlation to multivariable explanation. This structure allows me to assess whether density independently predicts post-COVID price changes or whether it proxies for deeper socioeconomic mechanisms.


Outcome Variable

First, the outcome of interest is the percentage change in median residential property prices at the community district level, comparing pre-COVID (2017–2019) to post-COVID (2021–2023) periods. Year 2020 is excluded to avoid short-term pandemic disruptions and abnormal transaction patterns.

Formally, for each community district ( i ):

\[ \Delta P_i = \frac{\overline{P}_{i,\text{post}} - \overline{P}_{i,\text{pre}}}{\overline{P}_{i,\text{pre}}} \times 100 \]

where prices are computed using arms-length residential transactions exceeding $10,000.


Density Measures

Baseline residential density is measured using the 2019 ACS 5-year estimates, aggregated from census tracts to community districts via area-weighted spatial intersection. Two density measures are initially considered:

  • Population density (people per square mile)
  • Housing density (housing units per square mile)

Because these measures capture related aspects of neighborhood structure, including both in a single regression may introduce multicollinearity and obscure interpretation. To determine which measure to retain, I assess multicollinearity using variance inflation factors (VIF).

Variance Inflation Check for Density Measures
library(car)

# Fit model including both density measures
vif_model <- lm(
  price_change_pct ~
    pop_density_2019 +
    housing_density_2019,
  data = cd_analysis_df
)

# Compute VIFs and build tidy table explicitly
vif_raw <- car::vif(vif_model)

vif_table <- tibble(
  variable = names(vif_raw),
  vif = as.numeric(vif_raw)
) |>
  mutate(
    variable = case_when(
      variable == "pop_density_2019" ~ "Population Density (2019)",
      variable == "housing_density_2019" ~ "Housing Density (2019)",
      TRUE ~ variable
    ),
    vif = round(vif, 2),
    multicollinearity = case_when(
      vif >= 10 ~ "Severe",
      vif >= 5  ~ "Moderate",
      TRUE      ~ "Low"
    )
  ) |>
  arrange(desc(vif))

# Display table
vif_table |>
  kable(
    col.names = c("Variable", "VIF", "Multicollinearity Level"),
    caption = "Variance Inflation Factors for Alternative Density Measures",
    align = c("l", "r", "l")
  )
Variance Inflation Factors for Alternative Density Measures
Variable VIF Multicollinearity Level
Population Density (2019) 5.36 Moderate
Housing Density (2019) 5.36 Moderate

VIF results indicate moderate multicollinearity between population density and housing density, so I retain population density as the primary measure of residential density, as it more directly reflects crowding and interpersonal exposure relevant to the COVID pandemic.


Step 1: Descriptive Analysis

I begin with a descriptive analysis using histograms and density terciles to assess whether high-density community districts appear to have experienced weaker post-COVID price growth. This step establishes whether a density–price pattern exists but does not control for confounding factors.

Visualizing the distribution of property value changes across community districts demonstrate how property values varied substantially across NYC’s 59 community districts. We note the median district experienced a moderate 15.9% price growth, but the range spans large declines, -15.3%, in some districts to substantial appreciation, 57.2%, in others.

Distribution of Property Value Changes Across Community Districts

Baseline population density across CDs ranges from 6,134 to 94,421 people per square mile—from suburban Staten Island to the ultra-dense parts of Manhattan.

Distribution of Population Density Across Community Districts

Partitioning CDs into density terciles reveals a clear gradient: low-density areas averaged 19.6% price growth compared to just 15.4% in high-density areas.

Price Change by Density Tercile
Density Tercile N Median Pop. Density Median Housing Density Median Price (Post-Pre COVID)% Change Mean Price (Post-Pre COVID)% Change SD Price (Post-Pre COVID)% Change
Low 20 22,430 8,019 20.4% 19.6% 9.6%
Medium 19 39,997 16,890 16.1% 17.5% 15.5%
High 20 64,552 27,140 9.4% 15.4% 16.7%

Our exploratory analysis suggests that CDs with higher densities appear to have weaker post-COVID price growth than lower-density districts.


Regressions

Model 1: Density Regression

To quantify the raw association between density and post-COVID price change, I estimate a bivariate regression:

\[ \Delta P_i = \alpha + \beta \cdot \text{Density}_{i,2019} + \varepsilon_i \]

This model captures the unconditional correlation between density and price change and provides a baseline benchmark against which more complex models are evaluated.

Regression of Price Change on Baseline Density
# Simple regression: unconditional correlation** between density and price change
price_density_model <- lm(price_change_pct ~ pop_density_2019, data = cd_analysis_df)

# Display regression results
tidy_regression(
  price_density_model,
  caption = "Model 1: Bivariate Relationship Between Density and Price Change"
)
Model 1: Bivariate Relationship Between Density and Price Change
Variable Coefficient Std. Error t-stat p-value CI Lower CI Upper
Intercept 20.9690 4.0659 5.1573 < 0.001 12.8272 29.1108
Population Density (2019) -0.0001 0.0001 -0.9528 0.3447 -0.0002 0.0001
Regression of Price Change on Baseline Density
m1_density_coef <- coef(price_density_model)["pop_density_2019"]
m1_density_pval <- summary(price_density_model)$coefficients["pop_density_2019", "Pr(>|t|)"]
m1_r2 <- summary(price_density_model)$r.squared

In the bivariate regression, population density has a coefficient of -0.0000791793 (p = 0.345), indicating that a 1,000-person-per-square-mile increase in population density is associated with an approximately 0.079 percentage point decrease in post-COVID property price growth. This relationship is not statistically significant at the 5% level. The model explains 1.6% of the variation in price changes across community districts.

Price Perfomance Map
# ==============================================================================
# VISUALIZATION: LEAFLET MAP – RELATIVE PRICE PERFORMANCE
# ==============================================================================
library(leaflet)
library(scales)

# --- Join geometry to analysis data ---
map_data <- nyc_cd |>
  left_join(cd_analysis_df, by = "cd_id") |>
  filter(!is.na(price_change_pct))

# --- Transform to WGS84 for Leaflet (CRITICAL!) ---
map_data <- map_data |> st_transform(4326)

# --- Compute citywide average price change ---
citywide_mean <- mean(map_data$price_change_pct, na.rm = TRUE)

# --- Compute deviation and category ---
map_data <- map_data |>
  mutate(
    deviation_from_avg = price_change_pct - citywide_mean,
    performance_category = case_when(
      deviation_from_avg <= -10 ~ "Strong Underperformance (< -10%)",
      deviation_from_avg > -10 & deviation_from_avg <= -5 ~ "Moderate Underperformance (-5% to -10%)",
      deviation_from_avg > -5 & deviation_from_avg < 5 ~ "Near Citywide Average (±5%)",
      deviation_from_avg >= 5 & deviation_from_avg < 10 ~ "Moderate Outperformance (+5% to +10%)",
      deviation_from_avg >= 10 ~ "Strong Outperformance (> +10%)"
    ),
    performance_category = factor(
      performance_category,
      levels = c(
        "Strong Underperformance (< -10%)",
        "Moderate Underperformance (-5% to -10%)",
        "Near Citywide Average (±5%)",
        "Moderate Outperformance (+5% to +10%)",
        "Strong Outperformance (> +10%)"
      )
    )
  )


# --- Color palette using named vector (like the working code) ---
performance_colors <- c(
  "Strong Underperformance (< -10%)" = "#d7191c",
  "Moderate Underperformance (-5% to -10%)" = "#fdae61",
  "Near Citywide Average (±5%)" = "#ffffbf",
  "Moderate Outperformance (+5% to +10%)" = "#a6d96a",
  "Strong Outperformance (> +10%)" = "#1a9641"
)

performance_pal <- colorFactor(
  palette = performance_colors,
  domain = map_data$performance_category
)

# --- Create popup text ---
map_data <- map_data |>
  mutate(
    popup_text = paste0(
      "<strong>", cd_id, " (", borough, ")</strong><br>",
      "<strong>Price Change: ", round(price_change_pct, 1), "%</strong><br>",
      "Deviation from NYC Avg: ", round(deviation_from_avg, 1), "%<br>",
      "NYC Average: ", round(citywide_mean, 1), "%<br>",
      "Population Density (2019): ",
      comma(round(pop_density_2019, 0)), " ppl/sq mi<br>",
      "Density Tercile: ", density_tercile
    )
  )

# --- Create the leaflet map ---
map <- leaflet(map_data) |>
  addProviderTiles(providers$CartoDB.Positron) |>
  addPolygons(
    fillColor = ~performance_pal(performance_category),
    fillOpacity = 0.7,
    color = "white",
    weight = 2,
    opacity = 1,
    highlight = highlightOptions(
      weight = 3,
      color = "#666",
      fillOpacity = 0.9,
      bringToFront = TRUE
    ),
    popup = ~popup_text,
    label = ~paste0(cd_id, ": ", round(deviation_from_avg, 1), "%"),
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "12px",
      direction = "auto"
    )
  ) |>
  addLegend(
    position = "bottomright",
    pal = performance_pal,
    values = ~performance_category,
    title = "Post-COVID Price Performance<br>(vs NYC Average)",
    opacity = 0.7
  ) |>
  addControl(
    html = paste0(
      "<div style='background:white; padding:10px; border-radius:5px;'>
        <strong>Post-COVID Price Performance</strong><br>
        NYC Average: ", round(citywide_mean, 1), "%<br>
        <em>Descriptive comparison only</em>
      </div>"
    ),
    position = "topright"
  ) |>
  setView(lng = -73.95, lat = 40.7, zoom = 10)

# Display the map
map

This map illustrates post-COVID housing price changes relative to the citywide average across community districts. While higher-density districts appear more likely to underperform descriptively, this pattern reflects raw differences rather than causal effects and subsequent regression analysis shows this pattern largely reflects broader socioeconomic and geographic structure rather than an independent density effect.

Model 2: Controlled Individual Model

To test whether density predicts price changes independently of pre-COVID confounders, I form a cross-sectional regression:

\[ \Delta P_i = \alpha + \beta \cdot \text{Density}_{i,2019} + \gamma \cdot \text{Income}_{i,2019} + \delta \cdot \text{Housing Stock}_{i,2019} + \varepsilon_i \]

where income and housing stock are measured prior to COVID to avoid post-treatment bias. This model evaluates whether the supposed ‘density penalty’ persists once baseline socioeconomic conditions are held constant.

Regression of Price Change on Baseline Density and Baseline Controls
# Cross-sectional regression:
# Result: pre–post percentage change in median prices
# Predictor: baseline population density (2019)
# Controls: pre-COVID income and housing stock
price_density_model_controls <- lm(
  price_change_pct ~ 
    pop_density_2019 + 
    median_income_2019 + 
    total_housing_2019,
  data = cd_analysis_df
)

# Display regression results
tidy_regression(
  price_density_model_controls,
  caption = "Model 2: Density and Baseline Controls"
)
Model 2: Density and Baseline Controls
Variable Coefficient Std. Error t-stat p-value CI Lower CI Upper
Intercept 36.0929 6.5451 5.5145 < 0.001 22.9761 49.2096
Population Density (2019) -0.0001 0.0001 -1.2630 0.2119 -0.0003 0.0001
Median Income (2019) -0.0002 0.0001 -3.7395 < 0.001 -0.0003 -0.0001
Total Housing Units (2019) 0.0000 0.0001 -0.0248 0.9803 -0.0002 0.0002
Regression of Price Change on Baseline Density and Baseline Controls
m2_density_coef <- coef(price_density_model_controls)["pop_density_2019"]
m2_density_pval <- summary(price_density_model_controls)$coefficients["pop_density_2019", "Pr(>|t|)"]
m2_income_coef <- coef(price_density_model_controls)["median_income_2019"]
m2_income_pval <- summary(price_density_model_controls)$coefficients["median_income_2019", "Pr(>|t|)"]
m2_r2 <- summary(price_density_model_controls)$r.squared

In this controlled regression, we see baseline population density has a negative coefficient, but the effect is not statistically significant once pre-COVID income and housing stock are included (p = 0.21). In contrast, median income is strongly negatively associated with price growth, indicating that higher-income districts experienced systematically weaker price growth over this period.

After controlling for median income and housing stock, the density coefficient becomes -0.0000975842 (p = 0.212), which is not statistically significant. The inclusion of controls amplifies the density effect by 23.2%, suggesting that much of the raw correlation between density and price changes operates through income differences. Model fit improves to R² = 0.229.

Model 3: Density Terciles

Furthermore, to examine whether the density penalty is concentrated at extreme levels or operates across the full density distribution, I estimate a complementary model using density terciles from baseline population density, comparison across low-, medium-, and high-density districts.

\[ \Delta P_i = \alpha + \beta_1 \cdot \text{Medium}_{i} + \beta_2 \cdot \text{High}_{i} + \gamma \cdot \text{Income}_{i} + \delta \cdot \text{Housing}_{i} + \varepsilon_i \]

where low-density districts serve as the reference category.

Regression Using Density Terciles
# Tercile-based regression:
# Low-density districts serve as the reference group
price_density_tercile_model <- lm(
  price_change_pct ~ 
    density_tercile + 
    median_income_2019 + 
    total_housing_2019,
  data = cd_analysis_df
)

tidy_regression(
  price_density_tercile_model,
  caption = "Model 3: Density Terciles with Baseline Controls"
)
Model 3: Density Terciles with Baseline Controls
Variable Coefficient Std. Error t-stat p-value CI Lower CI Upper
Intercept 34.5869 6.6022 5.2387 < 0.001 21.3503 47.8235
Density: Medium (ref = Low) -1.6135 4.1512 -0.3887 0.699 -9.9362 6.7092
Density: High (ref = Low) -4.6604 4.0883 -1.1399 0.2593 -12.8570 3.5362
Median Income (2019) -0.0002 0.0001 -3.6255 < 0.001 -0.0003 -0.0001
Total Housing Units (2019) 0.0000 0.0001 -0.1893 0.8506 -0.0002 0.0002
Regression Using Density Terciles
# Extract statistics
m3_medium_coef <- coef(price_density_tercile_model)["density_tercileMedium"]
m3_medium_pval <- summary(price_density_tercile_model)$coefficients["density_tercileMedium", "Pr(>|t|)"]

m3_high_coef <- coef(price_density_tercile_model)["density_tercileHigh"]
m3_high_pval <- summary(price_density_tercile_model)$coefficients["density_tercileHigh", "Pr(>|t|)"]

m3_r2 <- summary(price_density_tercile_model)$r.squared

Using density terciles with low-density districts as the reference group, the regression estimates the following differences in post-COVID property price growth:

In the tercile-based regression, medium-density districts have a coefficient of -1.614 (p = 0.699), indicating that medium-density districts experienced approximately 1.6 percentage points lower price growth relative to low-density districts. This difference is not statistically significant at the 5% level.

High-density districts exhibit a coefficient of -4.66 (p = 0.259), implying that high-density districts experienced approximately 4.7 percentage points lower price growth compared to low-density districts. This effect is not statistically significant at the 5% level.

Taken together, the tercile coefficients display a monotonic decline from low- to high-density districts, suggesting that density terciles do not independently predict price changes once sampling variability is accounted for. Overall model fit is modest, with an R² of 0.226, indicating that density category alone explains a limited share of cross-district variation in price changes.

Density Tercile Choropleth
# ==============================================================================
# VISUALIZATION: DENSITY TERCILE CHOROPLETH
# ==============================================================================
# --- Join geometry to analysis data ---
map_data <- nyc_cd |>
  left_join(cd_analysis_df, by = "cd_id") |>
  filter(!is.na(density_tercile))

# --- Transform to WGS84 for Leaflet (CRITICAL!) ---
map_data <- map_data |> st_transform(4326)

# --- Ensure density_tercile is a factor with fixed order ---
map_data <- map_data |>
  mutate(
    density_tercile = factor(
      density_tercile,
      levels = c("Low", "Medium", "High")
    )
  )

# --- Color palette using named vector (like the working code) ---
density_colors <- c(
  "Low" = "#4575b4",
  "Medium" = "#fee090",
  "High" = "#d73027"
)

density_pal <- colorFactor(
  palette = density_colors,
  domain = map_data$density_tercile
)

# --- Create popup text ---
map_data <- map_data |>
  mutate(
    popup_text = paste0(
      "<strong>", cd_id, "</strong><br>",
      "Density Tercile: ", density_tercile, "<br>",
      "Population Density (2019): ",
      comma(round(pop_density_2019, 0)), " people/sq mi"
    )
  )

# --- Create leaflet map ---
density_map <- leaflet(map_data) |>
  addProviderTiles(providers$CartoDB.Positron) |>
  addPolygons(
    fillColor = ~density_pal(density_tercile),
    fillOpacity = 0.75,
    color = "white",
    weight = 1.5,
    opacity = 1,
    highlight = highlightOptions(
      weight = 3,
      color = "#666",
      fillOpacity = 0.9,
      bringToFront = TRUE
    ),
    popup = ~popup_text,
    label = ~paste0(cd_id, ": ", density_tercile),
    labelOptions = labelOptions(
      style = list("font-weight" = "normal", padding = "3px 8px"),
      textsize = "12px",
      direction = "auto"
    )
  ) |>
  addLegend(
    position = "bottomright",
    pal = density_pal,
    values = ~density_tercile,
    title = "Population Density (2019)",
    opacity = 0.8
  ) |>
  addControl(
    html = "<div style='background:white; padding:8px; border-radius:4px;'>
              <strong>Density Terciles</strong><br>
              Based on 2019 population per square mile
            </div>",
    position = "topright"
  ) |>
  setView(lng = -73.95, lat = 40.7, zoom = 10)

# Display the map
density_map

And this map shows the spatial distribution of baseline residential density across New York City’s community districts, illustrating the strong clustering of high-density districts in Manhattan and inner Brooklyn/Queens that motivates the following borough fixed effects regression analysis.

Model 4: Borough Fixed Effects

Because density varies sharply across boroughs, especially between Manhattan and the outer boroughs, I further estimate a regression model including borough fixed effects:

\[ \Delta P_i = \alpha + \beta \cdot \text{Density}_{i,2019} + \gamma \cdot \text{Income}_{i,2019} + \delta \cdot \text{Housing Stock}_{i,2019} + \eta \cdot \text{Borough}_i + \varepsilon_i \]

This regression tests whether density operates within boroughs, or whether it simply captures broad geographic differences:

Regression with Borough Fixed Effects
# Make Staten Island the reference category
cd_analysis_df <- cd_analysis_df |>
  mutate(borough = factor(borough, 
                          levels = c("Staten Island", "Bronx", "Brooklyn", 
                                   "Manhattan", "Queens")))

# Add borough fixed effects to test whether density proxies for Manhattan
price_density_borough_fe <- lm(
  price_change_pct ~
    pop_density_2019 +
    median_income_2019 +
    total_housing_2019 +
    borough,
  data = cd_analysis_df
)

tidy_regression(
  price_density_borough_fe,
  caption = "Model 4: Density with Borough Fixed Effects (Reference: Staten Island)"
)
Model 4: Density with Borough Fixed Effects (Reference: Staten Island)
Variable Coefficient Std. Error t-stat p-value CI Lower CI Upper
Intercept 26.1980 9.9905 2.6223 0.0115 6.1412 46.2547
Population Density (2019) -0.0001 0.0001 -0.7104 0.4807 -0.0003 0.0001
Median Income (2019) -0.0001 0.0001 -1.6366 0.1079 -0.0002 0.0000
Total Housing Units (2019) 0.0001 0.0001 0.8663 0.3904 -0.0001 0.0003
Borough FE: Bronx 9.5828 8.8855 1.0785 0.2859 -8.2555 27.4212
Borough FE: Brooklyn -5.9354 8.0962 -0.7331 0.4668 -22.1891 10.3183
Borough FE: Manhattan -9.7786 9.6129 -1.0172 0.3138 -29.0772 9.5201
Borough FE: Queens -5.3173 7.7402 -0.6870 0.4952 -20.8564 10.2218
Regression with Borough Fixed Effects
# Extract borough effects with Staten Island as reference
bronx_vs_si <- coef(price_density_borough_fe)["boroughBronx"]
brooklyn_vs_si <- coef(price_density_borough_fe)["boroughBrooklyn"]
manhattan_vs_si <- coef(price_density_borough_fe)["boroughManhattan"]
queens_vs_si <- coef(price_density_borough_fe)["boroughQueens"]

# Extract p-values
bronx_pval <- summary(price_density_borough_fe)$coefficients["boroughBronx", "Pr(>|t|)"]
manhattan_pval <- summary(price_density_borough_fe)$coefficients["boroughManhattan", "Pr(>|t|)"]
queens_pval <- summary(price_density_borough_fe)$coefficients["boroughQueens", "Pr(>|t|)"]
brooklyn_pval <- summary(price_density_borough_fe)$coefficients["boroughBrooklyn", "Pr(>|t|)"]

Using Staten Island as the reference category, the borough fixed effects capture how post-COVID property price growth differed across boroughs after controlling for population density, median income, and housing stock.

Relative to Staten Island, Manhattan, Brooklyn, and Queens show lower estimated price growth, while the Bronx shows higher estimated growth; however, none of these differences are statistically significant at = 0.05 level (all p-values > 0.25).

Once borough-level differences are accounted for, neither residential density nor borough identity exhibits a statistically significant independent effect on post-COVID price changes. This indicates that the supposed “density penalty” observed in simpler models does not survive controls for broader geographic structure, and that post-COVID price divergence is driven by correlated neighborhood characteristics rather than density itself.

Model 5: Integration with Team Analysis

The Overarching Question

Did COVID-19 reshape the relationship between neighborhood characteristics and property values across NYC’s Community Districts (CDs)?

My individual analysis in Model 1 showed that high-density community districts experienced weaker post-COVID price growth in isolation. However, density is strongly correlated with many other neighborhood features as was reflected in Models 2-4. To determine what deeper structural forces are reflected by density - I integrated my density measures into a multivariable model incorporating all five team-studied dimensions:

  • Residential density
  • Educational attainment (percent BA+)
  • Transit ridership change
  • Crime rate change
  • Job accessibility change

\[ \Delta P_i = \alpha + \beta_1 \cdot \Delta \text{Density}_i + \beta_2 \cdot \Delta\text{Education}_i + \beta_3 \cdot \Delta\text{Transit}_i + \beta_4 \cdot \Delta\text{Crime}_i + \beta_5 \cdot \Delta\text{Jobs}_i + \varepsilon_i \]

This integrated approach allows us to understand whether COVID reshaped housing values through physical form or through who lives and works in neighborhoods.

Team Data Integration

Team Analysis Dataset (First 10 Community Districts)
Community District Price Change (%) Population Density (2019) BA+ Attainment (%) Transit Ridership Change (%) Crime Rate Change (%) Job Accessibility Change (%)
BK01 21.66 39997.45 47.71 -34.00 2.31 -28.83
BK02 19.86 43692.71 66.86 -44.02 -13.34 -38.32
BK03 10.56 59900.58 37.79 -43.53 -20.83 -38.92
BK04 9.49 55429.11 30.57 -38.46 15.49 1.16
BK05 38.58 32814.08 14.24 -46.57 3.77 -32.43
BK06 15.96 37335.60 72.90 -41.10 6.52 -39.46
BK07 7.93 32476.31 35.03 -34.24 16.50 -26.28
BK08 21.79 62329.13 45.51 -37.66 -0.85 -32.92
BK09 0.56 61788.15 34.85 -40.12 -18.30 -31.60
BK10 12.41 31416.72 42.43 -34.90 -3.68 -40.04

Team Model Results

Team Model: All Neighborhood Characteristics → Price Change
Variable Coefficient Std Error t-stat p-value CI Lower CI Upper
Intercept 2.1138 17.0049 0.1243 0.9016 -32.1359 36.3635
Residential Density (2019) 0.0000 0.0001 -0.0357 0.9717 -0.0002 0.0002
BA+ Attainment % (2019) -0.4235 0.0984 -4.3050 < 0.001 -0.6216 -0.2253
Transit Ridership Change % -0.6618 0.3519 -1.8803 0.0665 -1.3706 0.0471
Crime Rate Change % -0.0283 0.0975 -0.2902 0.773 -0.2246 0.1680
Job Accessibility Change % -0.1315 0.2609 -0.5040 0.6167 -0.6570 0.3940

Educational Attainment (% with BA or Higher)

Educational attainment is the strongest and only statistically significant predictor in the model (β = -0.4235, p < 0.001). Each one–percentage point increase in BA+ attainment is associated with a 0.423 percentage point decrease in post-COVID price growth, holding other factors constant. This suggests that socioeconomic composition, rather than physical density or amenities, was the dominant driver of post-COVID housing market divergence.

Residential Density (Population per Square Mile)

Population density still remains statistically insignificant, indicating that density does not independently predict post-COVID price changes once education, transit, crime, and job accessibility are accounted for.

Transit Ridership Change

Transit ridership recovery is negatively associated with price growth (β = -0.6618), but insignificant (p = 0.0665). Changes in transit usage do not independently explain price trajectories after controlling for education, crime rate, job accessibility and density.

Crime Rate Change

Changes in crime rates show no statistically significant association with post-COVID price growth (β = -0.0283, p = 0.773), indicating that neighborhood safety conditions did not operate as an independent mechanism affecting property values during this period.

Job Accessibility Change

Job accessibility changes are also not statistically significant (β = -0.1315, p = 0.617), suggesting that traditional job–housing linkages weakened during COVID, consistent with the expansion of remote work.

Increasing Model Performances
model_comparison <- tibble(
  Model = c(
    "Model 1: Density Only",
    "Model 2: + Controls",
    "Model 3: Density Terciles",
    "Model 4: + Borough FE",
    "Model 5: Multivariable Regression"
  ),
  N = c(
    nobs(price_density_model),
    nobs(price_density_model_controls),
    nobs(price_density_tercile_model),
    nobs(price_density_borough_fe),
    nobs(team_model)
    
  ),
  R_squared = c(
    summary(price_density_model)$r.squared,
    summary(price_density_model_controls)$r.squared,
    summary(price_density_tercile_model)$r.squared,
    summary(price_density_borough_fe)$r.squared,
    summary(team_model)$r.squared
  ),
  Adj_R_squared = c(
    NA,
    summary(price_density_model_controls)$adj.r.squared,
    summary(price_density_tercile_model)$adj.r.squared,
    summary(price_density_borough_fe)$adj.r.squared,
    summary(team_model)$adj.r.squared
    
  )
) |>
  mutate(
    across(where(is.numeric), ~ round(., 3))
  )

model_comparison |>
  kable(
    caption = "Model Fit Comparison Across Density Specifications",
    align = c("l", "r", "r", "r", "r")
  )
Model Fit Comparison Across Density Specifications
Model N R_squared Adj_R_squared
Model 1: Density Only 59 0.016 NA
Model 2: + Controls 59 0.229 0.187
Model 3: Density Terciles 59 0.226 0.169
Model 4: + Borough FE 59 0.386 0.302
Model 5: Multivariable Regression 51 0.368 0.298

The full team model explains 36.8% of the variation in post-COVID property price changes across NYC community districts (Adjusted R² = 0.298, N = 51), indicating that post-COVID housing outcomes reflect multiple correlated neighborhood characteristics rather than each charateristic alone. We can observe a visualization below:

Multivariable Regression Visualization
# Tidy team model for plotting
team_model_tidy <- tidy(team_model, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    term = case_when(
      term == "pop_density_2019" ~ "Population Density (2019)",
      term == "pct_ba_plus_2019" ~ "BA+ Attainment (%)",
      term == "ridership_change_pct" ~ "Transit Ridership Change (%)",
      term == "crime_pchange" ~ "Crime Rate Change (%)",
      term == "jobs_change_pct" ~ "Job Accessibility Change (%)",
      TRUE ~ term
    ),
    significant = if_else(p.value < 0.05, "Yes", "No")
  )

# Coefficient plot (rendered inline)
ggplot(
  team_model_tidy,
  aes(
    x = reorder(term, estimate),
    y = estimate,
    color = significant
  )
) +
  geom_point(size = 4) +
  geom_errorbar(
    aes(ymin = conf.low, ymax = conf.high),
    width = 0.2,
    linewidth = 1
  ) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "gray50") +
  coord_flip() +
  scale_color_manual(
    values = c("No" = "gray60", "Yes" = "#E31A1C")
  ) +
  labs(
    title = "Predictors of Post-COVID Property Value Change",
    subtitle = "Multivariable Model Across NYC Community Districts",
    x = NULL,
    y = "Estimated Effect on Price Change (%)",
    color = "Statistically Significant (p < 0.05)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 11),
    panel.grid.minor = element_blank()
  )

Summary of Findings

The multivariable team model reveals that educational attainment is the most significant driver of post-COVID property value changes out all other neighborhood characteristics, consistent with pandemic-era patterns favoring remote-work-capable populations.

The supposed “density penalty” I investigated actually reflects who lives in dense neighborhoods, not density itself. And COVID-19 reshaped NYC’s housing market primarily through demographic and occupational sorting, rather than through physical crowding, transit dependence, crime, or job proximity.

Limitations and Scope

This analysis remains an observational study. Although the pandemic provides a sharp temporal break, unobserved time-varying factors, such as changing preferences for schools, social spaces or neighborhood services, may still confound our regression. Additionally, density and education are measured using 2019 baselines and does not capture the extreme short-term population shifts during COVID. Finally, community districts remain aggregate units that may obscure distinctions among residences within block neighborhoods.

Nevertheless, our community districts design, regression hierarchy, and integration of multiple neighborhood mechanisms provide further evidence of the persistent impacts of COVID-19 that affects us to this day.

Conclusion

Overall, the COVID-19 pandemic did not fundamentally overturn the dynamics of urban density within New York City. While high-density neighborhoods experienced weaker price growth in raw comparisons, this pattern does not reflect an independent penalty associated with residential density. Instead, post-COVID price changes was significantly driven by who lives in specific neighborhoods and where those neighborhoods are located, not by crowding itself.

For future research, it would be constructive to investigate:

  • whether educated workers are permanently relocating
  • how returning to office requirements may affect preferences
  • whether less-educated neighborhoods can attract educated remote workers
  • and how patterns vary in cities with different remote work policies

Data Sources and References

New York City Department of Finance.
New York City Rolling Sales Data. NYC Open Data Portal. https://data.cityofnewyork.us/City-Government/DOF-Property-Sales/w2pb-icbu

New York City Department of City Planning.
Primary Land Use Tax Lot Output (PLUTO). https://data.cityofnewyork.us/City-Government/Primary-Land-Use-Tax-Lot-Output-PLUTO-/64uk-42ks
New York City Community District Boundaries. https://services5.arcgis.com/GfwWNkhOj9bNBqoJ/arcgis/rest/services/NYC_Community_Districts

U.S. Census Bureau.
American Community Survey (ACS) 5-Year Estimates, 2015–2019. https://www.census.gov/programs-surveys/acs

U.S. Census Bureau.
Longitudinal Employer–Household Dynamics (LEHD) Origin–Destination Employment Statistics (LODES). https://lehd.ces.census.gov/data/

Metropolitan Transportation Authority (MTA).
Annual Subway Ridership Statistics. https://www.mta.info/open-data

New York University Furman Center.
NYC Crime Indicators Database. https://furmancenter.org/

Hermann, A., & Whitney, P. (2024).
The Geography of Pandemic-Era Home Price Trends and the Implications for Affordability. Harvard Joint Center for Housing Studies.